Udacity Data Analyst Nanodegree - P4: Explore and summarize data
by Gabor Galgocz
6 March 2016
========================================================
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
I find it very important to look into the origin of the dataset, to understand what was the objective of creating it. The purpose of the dataset was to explore the effect different chemical factors may have on the quality of the white version of the Portuguese wine type called “Vinho Verde”. It is important to understand that the dataset consists of observations of a specific wine type from Portugal, thus we shouldn’t interpret the correlations found within dataset as relevant to any other white wine type. The Vinho Verde has a specific, characteristic flavour, and most probably the wine experts were looking for that specific flavour when they evaluated the wine’s quality. Other white wine types may have different correlations between their chemical characteristics and their perceived quality. More information on the Vinho Verde wine type: https://en.wikipedia.org/wiki/Vinho_Verde
The input variables were 11 chemical parameters of the tested wines, the output variable was the quality of wine, as evaluated by at least 3 wine experts.
Let’s load the dataset and let’s take a look at the dataset, including the variables, the data types, and also the structure and summary of the data.
For a detailed description of the variables, we can check the original description of the dataset: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The dataset includes 4898 observations across 12 variables, the details and descriptions of the variables can be found above. The input variables have numeric values (except the first one), and Quality, which is the output variable is an integer. In some cases I will be using quality as a factor variable, to make the charts more appropriate.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
I don’t have the necessary background in chemistry to point out any interesting parts of the chemical attributes like acidity, but having a basic familiarity with pH and alcohol content tells me that the “Vinho verde” wines in the sample are pretty acidic compared to other wines, and also their alcohol content can be higher (up to 14.2%) than of the usual wines (10-12%).
Now let’s make a histogram of wine quality, which is our main focus of interest:
The histogram shows that the distribution of the data points is close to normal distribution, although it is slightly skewed.
We can also take a look at the exact numbers to see how many data points are in each category. Looking at the data using both a histogram and a table is a good idea to get an understanding of the distribution of the values and the finer details too. For example on the histogram it’s not easy to see whether there are more wines that belong to the quality category 4 or 8. Adding the table makes this easy.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Another common visualization of the distribution of the data points for one variable is using a box plot. This helps us see the distribution in another way, by highlighting the median, the quartiles and the outliers.
Now I will check all the variables in our dataset to see the possible interesting distributions.
The distributions would be easier to see with adjusted the binwidths, so I am going to do that as the following step.
Also it is a useful step to remove the outliers. I will remove the top and bottom 1%.
Most variables have normal distribution, with a very slight skewed shape to the left. An interesting exception is residual sugar, which shows a bimodal distribution. Some variables have more visibly skewed distributions, I am going to add a log10 scale to see them in more detail.
Volatile acidity on a log10 scale shows a normal distribution, while alcohol content on a log10 scale shows an interesting, slightly bimodal distribution.
With a basic understanding of chemistry, I predict that pH, alcohol and residual sugar will be the most interesting variables to inspect.
At a later point in my investigation I created a “rating” variable, basically transforming the quality variable with a categorical variable with three values only, to make the multivariate visualizations to be more effective.
Yes, as I described above, I added a log10 scale to the alcohol and volatile acidity histograms, to better see their distribution.
After having finished the univariate exploration, it’s time to take a look at the correlations between variables. To start with let’s plot all the variables in all their possible combinations, to see which are the combinations which look interesting for further investigation. Using ggpairs, we can have an overview of all the plots.
Apparently we have too many variables for this visualization, the labels are overlapping a bit, and the scatterplots are also not very useful because of the points are covering each other. Let’s investigate the variables in separate pairs. First, let’s see how the different variables correlate with wine quality to see which ones should we plot.
## [,1]
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## free.sulfur.dioxide 0.008158067
## total.sulfur.dioxide -0.174737218
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
The correlation coefficients vary between positive and negative values, but none of them are close to 1 or -1, meaning there is no very strong correlation between any of the variables and wine quality. Still, let’s investigate the three variables which show the strongest correlations: alcohol (0.43), density (-0.3) and chlorides (-0.2). Since quality is a categorical variable, I will use boxplots to visualize the correlations between quality and the other variables.
The boxplot shows an interesting distribution, the median values are lowest for the average quality wine, and they are somewhat higher for below average quality wine and considerably higher for above average quality wine. We should investigate the trend a bit more in detail. We don’t see though how many wines belong to each quality category, so we should either check the histogram or add jitter at a later step to visualize the sitribution of wines.
The boxplot which plots quality vs density has some aspects that we should improve. The points are strongly overlapping each other, which makes it difficult to see if there are some areas which contain many points. There are also some outliers which we could remove, that would also help to see more clearly, because currently most points are plotted in a small area of the graph.
The boxplot which plots quality vs chlorides is similar to the previous one: many overlapping points, the outliers also contribute to the fact that the majority of the points are in a small area.
Let’s add some jitter to the boxplots and also let’s remove the outliers where needed.
There are no outliers on this first boxplot, so here I added only some jitter to make the individual points more distinguishable. It made it easier to see where the most values are plotted. The correlation is easier to identify now.
For the next two boxplots I also removed the bottom and top 1% of values, to get rid of the outliers.
Comparing the original version of the boxplots with the improved ones, it’s clear how much more the individual points and the overall trend is recognizable.
Plotting all the variables in one big view was not the most efficient way to explore the dataset. Processing all the data took a lot of time, and the individual charts were way too small to discern meaningful details. There were two clear correlations visible, the one being the correlation between sugar and density (positive linear correlation), and the other being between alcohol and density (negative linear correlation). This suggests that I should look into the chemical processes of how sugar, alcohol and density are influencing one another, as there is a clear strong relation between them.
As described above.
Inspecting the correlations, it is clear that the only input variable which significantly correlates with quality was the alcohol content.
We can add a third variable to visualize, using color coding. The alcohol content is displayed along the X axis, the pH value on the Y axis, while the quality is using the color coding.
It is hard to distinguish between the various quality levels, apparently we have too many factors, and using only the difference between the hues of one single colour is not the most efficient way to visualize the data.
To make the visualization more efficient, we can create categories for quality, and use only the three categories. If we use three different colors, we can identify the three categories easily.
It seems like pH values are not really important factors when it comes to the quality of wine, it seems like the alcohol level is a lot more linked to wine quality. It is very easy to see how the red points, marking wines below average quality are on the left side of the plot, while the high quality wines (green points) are on the right.
We can explore other combinations of variables too, still using the quality as the color coded variable.
Visualizing pH values vs chlorides doesn’t reveal any strong correlations. Let’s try some other variables!
Plotting alcohol vs density reveals two patterns. We’ve seen earlier that there is a strong relation between alcohol and quality, but we see another pattern too: lower quality wines have a higher density, while higher quality wines have lower density.
The multivariate analysis reinforced what was becoming clear in the previous stage, that alcohol content is the strongest factor that influences quality, and the other factors are not so relevant. When it comes to this dataset, I think multivariate plots were allowing for a more stunning visualzation of the correlations described earlier.
This step didn’t reveal any new findings that was not described by the previous examinations.
The first plot is a histogram, which explores the distribution of the quality of the wines. This gives the viewer a very good overview of the sample, showing a classical normal distribution. It peaks at 6, this means that the median is at this category. The categories 3 and 9 contain just a few values, these might be considered outliers.
This boxplot shows the strong linear correlation between the alcohol content of the wine and its perceived quality. There is a positive correlation between alcohol content and perceived quality, meaning that higher alcohol content correlates with higher perceived quality. There are a few values with quality 3 and 4 which have apparently a higher alcohol %, but using the histogram (or adding the jitter to the boxplot) made it clear that there are only a few items here, the majority of the wines show a clear linear correlation.
This is the most stunning visualization of the relationship between alcohol content and perceived wine quality. Using color as a visual cue is a very useful way of communicating the key finding, and it is even improved by the introduction of rating categories. The human eye finds it easier to understand a visualization if it only has a limited number of colors, and adding the rating categories helped to accomplish that. The added smoothing also reveals a slight positive correlation between alcohol content and pH, meaning that higher alcohol content correlates with slightly higher pH values.
I found the dataset very interesting, and though my background knowledge in chemistry is limited, luckily the main finding didn’t require a deep understanding of the different kinds of acids and how they affect wine quality. The findings were surprising (at least for me, I never heard about this correlation before) - it made me curious about reading more about the topic - and I think this is the main purpose of EDA, to use some simple methods to orient the research towards further directions.
Some questions that came to my mind during the exploration: - are these findings relevant to other kinds of wine too? Or only the Portuguese Vinho Verde? - I think apart from chemical factors, I find it reasonable to add other factors that are mostly linked to meteorogical data: rainfall and temperature across time, or the chemical properties of the soil. I think these factors are also very important and have a strong correlation with the quality of wine.